[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718

Yicong-Huang · 2026-01-07T22:57:53Z

What changes were proposed in this pull request?

Add tests for PyArrow's pa.array type inference behavior. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions.

The tests cover type inference across input categories:

Nullable data - with None values
Plain Python instances - list, tuple, dict (struct)
Pandas instances - numpy-backed Series, nullable extension types, ArrowDtype
NumPy array - all numeric dtypes, datetime64, timedelta64
Nested types - list of list, list of struct, struct of struct, struct of list
Explicit type specification - large_list, fixed_size_list, map_, large_string, large_binary

Types tested include:

Category	Types Covered
Primitive	int8/16/32/64, uint8/16/32/64, float16/32/64, bool, string, binary
Temporal	date32, timestamp (s/ms/us/ns), time64, duration (s/ms/us/ns)
Decimal	decimal128
Nested	list_, struct, map_ (explicit only)
Large variants	large_list, large_string, large_binary (explicit only)

Pandas extension types tested:

Nullable types: pd.Int8Dtype() ... pd.Int64Dtype(), pd.UInt8Dtype() ... pd.UInt64Dtype(), pd.Float32Dtype(), pd.Float64Dtype(), pd.BooleanDtype(), pd.StringDtype()
PyArrow-backed: pd.ArrowDtype(pa.int64()), pd.ArrowDtype(pa.float64()), pd.ArrowDtype(pa.large_string()), etc.

Why are the changes needed?

This is part of SPARK-54936 to monitor behavior changes from upstream dependencies. By testing PyArrow's type inference behavior, we can detect breaking changes when upgrading PyArrow versions.

Does this PR introduce any user-facing change?

No. This PR only adds tests.

How was this patch tested?

New unit tests added:

python -m pytest python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py -v

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-01-07T22:58:02Z

JIRA Issue Information

=== Sub-task SPARK-54938 ===
Summary: Add tests for pa.array type inference
Assignee: Yicong Huang
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

HyukjinKwon · 2026-01-07T23:20:28Z

cc @zhengruifeng

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

zhengruifeng

thanks so much, it is much cleaner.

inspiried by https://github.com/apache/spark/pull/53727/changes,
I think we need to also test following cases:
1, string: non-english values;
2, integral: min max values, and make it overflow
3, floats: nan, -inf, inf
4, time: Unix epoch, min max values

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

…pe-inference

zhengruifeng · 2026-01-13T12:24:01Z

dev/sparktestsupport/modules.py

        # unittests for upstream projects
        "pyspark.tests.upstream.pyarrow.test_pyarrow_ignore_timezone",
        "pyspark.tests.upstream.pyarrow.test_pyarrow_scalar_type_inference",
+        "pyspark.tests.upstream.pyarrow.test_pyarrow_type_inference",


let's rename the file test_pyarrow_array_type_inference

zhengruifeng

otherwise, LGTM

zhengruifeng · 2026-01-13T23:33:25Z

thanks, merged to master

github-actions bot added BUILD CORE PYTHON labels Jan 7, 2026